The psychological method is just before it's next revolution. Forty something years ago psychologists abandoned the cages with pigeons and rats. Instead they studied humans by measuring their response times and giving them questionnaires. Now, the response boxes and questionnaires are obsolete. Virtual reality has taken their place. Why now? Why took it so long? There are two aspects to this. The hardware has become cheap and powerful enough to support immersive experience. On the software side, game engines like Unity and UE4 have become accessible (in terms of ease-of-use and price tag) and allow researchers to develop complex virtual worlds. The technology still requires further evolution and most of the issues (such as insufficient resolution and frame rate) are well-known and are being tackled. There are few issues which have been neglected, but need to be resolved before the technology can be used in research.

In this post I consider binaural audio. Wikipedia gives good introduction to binaural audio. This tutorial by Richard Duda goes into more detail. Binaural audio, when well implemented, can induce a strong feeling of presence - just like VR goggles do via the visual channel. Unfortunately, for whatever reasons modern video games and game engines do not use binaural audio.

In this post I introduce few simple models that try to create the immersive binaural experience by properly filtering the sound. Let's see how good these solutions are.

First, we will use this sound.


In [1]:
%pylab inline
buzz=np.load('buzz.npy')
fr=44100;tot=buzz.size/float(fr)*1000
plt.plot(np.linspace(0,tot,buzz.size),buzz);
plt.xlabel('time (ms)');plt.ylabel('amplitude')
K=500
buzz=np.concatenate([buzz]*K)


Populating the interactive namespace from numpy and matplotlib

We repeat the sound few hundred times to obtain 5 second buzz sound. This is what it sounds like: (You need to listen through headphones.)


In [2]:
from embedaudio import Audio as EmbedAudio
def embedLR(left,right=None,rate=44100):
    ''' assume int16 format'''
    if right is None: right=left
    out=np.concatenate([left[:,np.newaxis],
                        right[:,np.newaxis]],axis=1)
    out=out.flatten()/32767.
    return EmbedAudio(out,rate=rate,nrchannels=2)
embedLR(buzz,rate=fr)


Out[2]:

A naive approach to create the impression of left-right motion is to cross-fade the sound from left to the right channel.


In [3]:
padd=0.2 
movelr=np.ones(buzz.size)
movelr[padd*buzz.size:(1-padd)*buzz.size]=np.linspace(0,1,
                                (1-2*padd)*buzz.size)
movelr[:padd*buzz.size]=0
plt.plot(movelr[::-1])
plt.plot(movelr,'r');plt.ylim([-0.1,1.1]);
plt.legend(['left','right'],loc=9)
plt.xlabel('time');plt.ylabel('amplitude');

embedLR(left=buzz*movelr[::-1],right=buzz*movelr,rate=fr)


Out[3]:

This gives the impression of a sound source traveling inside the head. This is not what we want.

In order to precisely locate the sound source on the horizontal plane people rely (among other things) on the difference in the arrival times between the two ears. The sound from a source located to the right arrives sooner at the right ear, since the distance to the right ear is shorter. The figure below illustrates the relationship between the horizontal angle (aka azimuth) of the sound source and the arrival times.

Let $2a$ be the inter-ear distance in meters and $\theta$ denote the azimuth angle w.r.t. head direction. (Also we will assume that the head is a sphere.) At the left ear the sound travels additional distance of $a\theta$. The distance to the right ear is shorter by $a \sin \theta$ meters. The sound arrives $\frac{a}{c} \sin \theta$ seconds earlier at the right ear and $\frac{a}{c} \theta$ seconds later at the left ear. ($c$ is the speed of sound i.e. $c=343$ m/s). For $a=0.1$ meters and sound source at 90 degrees this gives the interaural time difference (ITD) of $\frac{a(1+\pi/2)}{c} = 0.75$ ms. It's easy to create sound that follows this ITD model.


In [4]:
def theta2dt(theta,ied=0.2):
    ''' theta - N vector with azimuth angle 
            wrt head direction in radians [-pi,pi]
        ied - inter ear distance in meters
        returns Nx2 array with dt in ms for 
            left and right ear respectively
    '''
    if np.any(np.abs(theta)>np.pi): 
        raise ValueError('theta>|pi|')
    C=ied/2./343.
    def _help(phi):
        sel=np.abs(phi)<=np.pi/2.
        out=np.zeros(phi.size)
        out[sel]= -np.cos(phi[sel])
        out[~sel]=np.minimum(np.abs(phi[~sel])-np.pi/2,
                    -np.abs(phi[~sel])+3*np.pi/2 )
        return out*C
    dt=np.zeros((theta.size,2))
    dt[:,0]=_help(theta+np.pi/2) #left
    dt[:,1]=_help(theta-np.pi/2) #right
    return dt
theta=movelr*np.pi-np.pi/2.
dt=theta2dt(theta)
plt.plot(theta/np.pi*180,dt*1000); 
plt.legend(['left','right'],loc=9)
plt.gca().set_xticks(np.linspace(-90,90,9))
plt.xlabel('theta [degrees]'); plt.ylabel('dt [ms]');


The figure above shows how much earlier/later the sound arrives at the ear than at the head origin. Here is what the result of the time shift sounds like.


In [5]:
from scipy.interpolate import interp1d 

t=np.linspace(0,buzz.size/float(fr),buzz.size)
l=interp1d(t+dt[:,0],buzz,bounds_error=False,fill_value=0)(t)
r=interp1d(t+dt[:,1],buzz,bounds_error=False,fill_value=0)(t)
embedLR(left=l,right=r,rate=fr)


Out[5]:

Another cue that people take into account when locating a sound source is provided by the head-shadow effect. As with light, the head attenuates the propagation of sound to the ear farther away. At low frequencies this effect is negligible. However at high frequencies it can be used by people to locate sound source. Assuming that the head is a perfect sphere it is easy to synthetize this effect by appropriately filtering the data. The transfer function (frequency domain) of the relevant filter is given by $H(s,\theta)=\frac{(1+cos\theta)s + \beta}{s+\beta}$. First, let me show you the amplitude spectrum of our buzz sound.


In [6]:
buzz=np.load('buzz.npy')
Fbuzz=np.fft.rfft(buzz)
s=np.linspace(0,fr/2.,Fbuzz.size)
spec= 2/float(buzz.size)*np.sqrt( Fbuzz*np.conjugate(Fbuzz))
plt.plot(s,np.log10(np.real(spec)));
plt.xlabel('frequency [hz]');plt.ylabel('amplitude [db]');
plt.xlim([0,10000]); plt.title('amplitude spectrum');
plt.ylim([0,5]);



In [7]:
def rigidSphereHRTF(theta,fr,ied):
    C=4*343/ied
    return ((1+np.cos(theta))*s+C)/(s+C)
ts=np.linspace(0,buzz.size/float(fr),buzz.size)
thsub=theta[np.int32(np.linspace(0,theta.size-1,K))] 

hrtfl=[];hrtfr=[]
for th in thsub.tolist():
    hrtfl.append(rigidSphereHRTF(th+np.pi/2,fr,ied=0.2))
    hrtfr.append(rigidSphereHRTF(th-np.pi/2,fr,ied=0.2))
hrtfl=np.array(hrtfl);hrtfr=np.array(hrtfr)
[X,Y]=np.meshgrid(thsub,s)
plt.pcolor(X,Y,hrtfl.T)
plt.xlim([thsub[0],thsub[-1]])
plt.ylim(s[0],s[-1]);plt.colorbar();
plt.xlabel('theta [rad]');plt.ylabel('hz');


The figure above shows the spectrum of the filter for different source azimuth $\theta$. The high frequencies are amplified at the near ear and attenuated at the farther ear. Recalling that filtering/convolution in time domain is equivalent to multiplication in the frequency domain we create the sound with head-shadow effect as follows.


In [8]:
ildl=[]
ildr=[]
for i in range(thsub.size):
    ildl.append(np.fft.irfft(Fbuzz*hrtfl[i,:]))
    ildr.append(np.fft.irfft(Fbuzz*hrtfr[i,:]))
ildl=np.concatenate(ildl);ildr=np.concatenate(ildr)

embedLR(left=ildl,right=ildr,rate=fr)


Out[8]:

However, to get the best result we need to combine the head-shadow model (HS) and the IDT model. If we use just one of them (as we did in the examples above) this gives conflicting cues to our auditory system. In particular the samples an impression of two separate sound sources.

The complete spherical head model sounds like this.


In [9]:
t=np.linspace(0,ildl.size/float(fr),ildl.size)
l=interp1d(t+dt[:,0],ildl,bounds_error=False,fill_value=0)(t)
r=interp1d(t+dt[:,1],ildr,bounds_error=False,fill_value=0)(t)
embedLR(left=l,right=r,rate=fr)


Out[9]:

Of course, the head is not a sphere and the body also alters the sound wave. Most crucially, the intricate shape of pinna allows people to infer the elevation of the sound source (albeit with less precision than the azimuth angle). None of the sound samples in this post provides any elevation cues. Most probably, at moments you got confused as to whether the sound originates in front or whether it comes from back. To create elevation cues we need to incorporate the effect of pinna into our sound generation model.